Introduction

blablabla

Data description and preprocessing

The datasets used in the below analysis were sourced from www.kaggle.com website 1. They were created based on several sources including the Bureau of Justice Statistics 2 and FBI Uniform Crime Reporting Program 3. The National Prisoner Statistics Program conducted by the Bureau of Justice Statistics has collected data on the number of prisoners in state and federal prison facilities since 1926. It is produced annually on national and state level. Data are sourced from 50 state departments of correction, the Federal Bureau of Prisons, and until 2001, from the District of Columbia. The UCR Program provides statistics on violent and property crimes. Data are collected annually and are available on national, state and city level. For the purposes of our analysis we are using state-level statistics.

Additionally, we individually collected data on federal expenditures (direct expenditures; corrections) for correctional institutions provided by the Bureau of Justice Statistics for each state between 2009 and 2016 4. The correctional institutions include prisons and penitentiaries, reformatories, jails, houses of correction, other named correctional institutions (correctional farms, workhouses, industrial schools, and training schools), institutions and facilities exclusively for the confinement of the criminally insane, institutions and facilities for the examination, evaluation, classification, and assignment of inmates, facilities for the confinement, treatment, and the rehabilitation of persons with drug or alcohol use disorders, if the institution is administered by a correctional agency. Later in the analysis we will use them in order to correlate the expenditures with the occurence of particular crimes. Apart from the number of total incarcerated prisoners we also collected information on the number of prisoners in private prisons across several states for the same years as prison expenditures [^prisoners].

UCR

The UCR dataset consist of 15 variables, two of which are the jurisdiction and year of the observation. It provides information about the state population and also about number of violent crimes (murder, manslaughter, rape, robbery, aggravated assault) and property crimes (burglary, larceny, vehicle theft) per state yearly. Detailed definitions of each crimes can be found on UCR Program website.

The crime_reporting_change variable reflects instances when states’ reporting standards changed. The crimes_estimated variable indicates cases where the FBI computes estimates for participating agencies not providing 12 months of complete data for state 5.

ucr <- read_csv("data/ucr_by_state.csv")
ucr$year <- as.factor(ucr$year)

Below are listed all column names from UCR dataset.

colnames(ucr)
##  [1] "jurisdiction"           "year"                  
##  [3] "crime_reporting_change" "crimes_estimated"      
##  [5] "state_population"       "violent_crime_total"   
##  [7] "murder_manslaughter"    "rape_legacy"           
##  [9] "rape_revised"           "robbery"               
## [11] "agg_assault"            "property_crime_total"  
## [13] "burglary"               "larceny"               
## [15] "vehicle_theft"          "X16"                   
## [17] "X17"                    "X18"                   
## [19] "X19"                    "X20"                   
## [21] "X21"

The UCR dataset has a lot of missing values, compared to the other datasets that have none. We dropped the last 6 columns that were completely empty and then we dropped rows consisting of only missing values. It leaves all columns without any missing values apart from “rape_revised” with 612 missing values and “rape_legacy” with 104 missing values.

# removing last 6 columns
ucr <- ucr[, -c(16:21)]
# removing all missing rows
ind <- apply(ucr, 1, function(x) all(is.na(x)))
ucr <- ucr[ !ind, ]
# showing sum of missing values per columns
sapply(ucr, function(x) sum(is.na(x)))
##           jurisdiction                   year crime_reporting_change 
##                      0                      0                      0 
##       crimes_estimated       state_population    violent_crime_total 
##                      0                      0                      0 
##    murder_manslaughter            rape_legacy           rape_revised 
##                      0                    104                    612 
##                robbery            agg_assault   property_crime_total 
##                      0                      0                      0 
##               burglary                larceny          vehicle_theft 
##                      0                      0                      0

As you can see on plot on the left below, in the last two years, 2016 and 2017, there is an additional obervation ie. jurisdiction. Looking at the plot on the right, New York is missing in one year, Puerto Rico is visible in only 3 years. District of Columbia is sometimes renamed as DC, but overall it sums up to all 17 years.

library(viridis)
plot.data1 = ucr %>% group_by(year) %>% count()
ggp1 = ggplot(data = plot.data1, aes(x=year, y=n, fill=year)) + 
  geom_bar(stat = "identity") +
  scale_fill_viridis_d() +
  scale_x_discrete(breaks = as.factor(seq(2001, 2017,2))) +
  theme_minimal() + 
  theme(axis.title.x = element_blank(), 
        axis.title.y = element_blank(),
        legend.position = "none")

plot.data2 = ucr %>% group_by(jurisdiction) %>% count() %>% arrange(n) %>% filter(n<17)
ggp2 = ggplot(data = plot.data2, aes(x=jurisdiction, y=n, fill=jurisdiction)) + 
  geom_bar(stat = "identity") +
  theme_minimal() + 
  scale_fill_viridis_d() +
  theme(axis.title.x = element_blank(), 
        axis.title.y = element_blank(),
        legend.position = "none")

grid.arrange(ggp1, ggp2, ncol = 2)

Based on the above analysis, we decided to rename “DC” to “District of Columbia” and exclude Puerto Rico state.

ucr$jurisdiction[ucr$jurisdiction=="DC"] <- "District of Columbia"
ucr <- ucr %>% filter(jurisdiction!="Puerto Rico")

We also analysed the missing values of variables rape_revised and rape_legacy. Because there are so many missings and they mostly do not occur in the same year, we can’t compare them and that’s why we decided to drop them. The comparison of number of observations for both variables can be seen in the table below.

rape_df <- data.frame(year=as.factor(2001:2017))

rape_revised_count <- ucr[!is.na(ucr$rape_revised),] %>% 
                            group_by(year) %>% 
                            count(name="rape_revised_count")
rape_legacy_count <- ucr[!is.na(ucr$rape_legacy),] %>% 
                            group_by(year) %>% 
                            count(name="rape_legacy_count")
rape_df <- left_join(rape_df, rape_revised_count, by="year")
rape_df <- left_join(rape_df, rape_legacy_count, by="year")

Hide data

Show data

kable_f(rape_df)
year rape_revised_count rape_legacy_count
2001 NA 51
2002 NA 51
2003 NA 51
2004 NA 51
2005 NA 51
2006 NA 51
2007 NA 51
2008 NA 51
2009 NA 51
2010 NA 51
2011 NA 51
2012 NA 51
2013 51 51
2014 51 51
2015 50 50
2016 51 NA
2017 51 NA

.

ucr$rape_legacy <- NULL
ucr$rape_revised <- NULL

These are the final columns that are in the UCR dataset.

colnames(ucr)
##  [1] "jurisdiction"           "year"                  
##  [3] "crime_reporting_change" "crimes_estimated"      
##  [5] "state_population"       "violent_crime_total"   
##  [7] "murder_manslaughter"    "robbery"               
##  [9] "agg_assault"            "property_crime_total"  
## [11] "burglary"               "larceny"               
## [13] "vehicle_theft"

On the plots below are presented distributions of number of different types of crimes. They are mainly concentrated near 0. There are however a few spikes visible especially in case of robbery, burglary and larceny indicating outliers. In case of larceny, we can also observe that it has two peaks that we can attribute to multiple modes.

pl <- vector("list", length = ncol(ucr[,c(5:13)])-1)
colors <- viridis(8)
for(ii in seq_along(pl)) {
  .col <- colnames(ucr[,c(5:13)])[-1][ii]
  .p <- ggplot(ucr, aes_string(x=.col, fill="colors[ii]", color="colors[ii]")) + 
          geom_density(alpha=0.3) + 
          scale_fill_manual(values = colors[ii], aesthetics = c("color", "fill")) +
          theme_minimal() +           
          theme(legend.position = "none",
                axis.title.x = element_blank(),
                axis.title.y = element_blank()) +
          labs(title = .col) + 
          scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + 
          scale_x_continuous(labels = function(x) format(x, scientific = FALSE)) 
  
  pl[[ii]] <- .p
}

grid.arrange(grobs=pl, ncol=2)

Incarcarations in prison

The prison data, compared to ucr is in a panel form, consisting of years as columns from 2001 to 2016. Using long_panel, we converted the dataframe so that each row is a different jurisdiction and year. The dataset includes also information indicating whether the number of prisoners also includes jails population.

prison <- read_csv("data/prison_custody_by_state.csv")

colnames(prison)[3:18] <- paste0(colnames(prison)[3:18],'1')
prison <- long_panel(prison, begin = 2001, end = 2016, label_location = "beginning", id = "jurisdiction")
names(prison)[names(prison) == "wave"] <- "year"
names(prison)[names(prison) == "1"] <- "prison"
prison$year <- as.factor(prison$year)

kable_f(head(prison))
jurisdiction year includes_jails prison
Alabama 2001 0 24,741
Alabama 2002 0 25,100
Alabama 2003 0 27,614
Alabama 2004 0 25,635
Alabama 2005 0 24,315
Alabama 2006 0 24,103

Number of prisoners in private prisons

Similarly to prison data, private_prisons are also presented in a panel form and are transformed using long_panel. Additionally, we created a column showing a percentage of private prisoners among all incarcerated.

private_prisons <- read_delim("data/private_prisons.csv", ";", na="/")

colnames(private_prisons)[2:9] <- paste0(colnames(private_prisons)[2:9],'1')
private_prisons <- long_panel(private_prisons, begin = 2009, end = 2016, label_location = "beginning", id = "jurisdiction")
names(private_prisons)[names(private_prisons) == "wave"] <- "year"
names(private_prisons)[names(private_prisons) == "1"] <- "private_prisons"
private_prisons$year <- as.factor(private_prisons$year)
#creating new column
private_prisons <- left_join(private_prisons, prison, by=c("state"="jurisdiction", "year")) %>%
                        mutate(private_prisons_pct = round(private_prisons/prison*100,4)) %>% 
                        select(year, state, private_prisons, private_prisons_pct)
private_prisons$jurisdiction <- NULL

kable_f(head(private_prisons))
year state private_prisons private_prisons_pct
2009 Alabama 883 3.2414
2010 Alabama 1,024 3.7447
2011 Alabama 545 2.0326
2012 Alabama 538 2.0099
2013 Alabama 554 2.0652
2014 Alabama 481 1.8397

Prison expenditures

The next set of data concerns federal expenditures on corrections. As in the situation of the number of prisoners and private prisons the data was converted from a panel form with long_panel.

prison_exp <- read_delim("data/prison_expenditures.csv", ";")
prison_exp <- long_panel(prison_exp, begin = 2009, end = 2016, prefix="_", label_location = "end", id = "jurisdiction")
names(prison_exp)[names(prison_exp) == "wave"] <- "year"
prison_exp$year <- as.factor(prison_exp$year)
prison_exp$jurisdiction <- NULL

kable_f(head(prison_exp))
year state prison_expenditure
2009 Alabama 711,538
2010 Alabama 734,776
2011 Alabama 723,829
2012 Alabama 711,823
2013 Alabama 698,906
2014 Alabama 713,432

State area and region

In order to enhance further visualisations, we add an information about state area and region based on R built-in us_states dataset. There are four basic regions distinguished at thi stage: Midwest, Northeast, South and West.

library(spData)
library(sf)
us_states_info <- data.frame(jurisdiction = us_states$NAME, 
                             region = us_states$REGION,
                             area_km2 = as.numeric(round(us_states$AREA, 0)))

kable_f(head(us_states_info))
jurisdiction region area_km2
Alabama South 133,709
Arizona West 295,281
Colorado West 269,573
Connecticut Norteast 12,977
Florida South 151,052
Georgia South 152,725

Because of the fact that there are two states missing in the us_states_info dataset, we manually added region and land area for Hawaii and Alaska 6.

additional_states <- data.frame(jurisdiction = c("Hawaii", "Alaska"),
           region = c("remote", "remote"),
           area_km2 = c(16638, 1481346))

us_states_info <- rbind(us_states_info, additional_states)

District of Columbia

The district of Columbia is named as a jurisdiction and a separate row in the prison_exp and ucr datasets, however they do not appear in prison nor private_prisons datasets.

cat("State that is in ucr dataset but does not appear in prison dataset:\n",
    setdiff(ucr$jurisdiction %>% unique(), prison$jurisdiction %>% unique()))
## State that is in ucr dataset but does not appear in prison dataset:
##  District of Columbia
cat("State that is in ucr dataset but does not appear in private_prisons dataset:\n",
    setdiff(ucr$jurisdiction %>% unique(), private_prisons$state %>% unique()))
## State that is in ucr dataset but does not appear in private_prisons dataset:
##  District of Columbia

The district of Columbia is also problematic for our calculations because of it’s unique area, population, urbanisation and overall crime rates compared to other states. The comparison between the average values for the District of Columbia and values for other states are presented in the table below. It is visible right away that the values for area of District of Columbia is 0.1% of the median value for other states. When calculating the rate of crime per square km, regardless of type, the numbers turn out significantly larger. As for population, the value is still significantly smaller however measuring crime per capita is much more accurate. The plots below show crimes in the District of Columbia as compared to other states. The nominal values of both crimes are rather lower, toward the end of the list of states. Values for both types of crimes per area, as suspected, are drastically higher for Washington DC because its urbanized area consisting of the country capital. However even when taking into account the values divided by the population which seems to be a more rational approach, it shows that the District of Columbia has the highest crime rates per capita. The difference is even larger for violent crimes. This isn’t a surprising discovery, since Washington DC has been fighting high crime rates, especially homicide since the crack epidemic in the 1980s. It has gained a nickname of the “murder capital” despite being the headquarters to many agencies. [^nyt] The aftermath can still be seen today and bearing in mind the size of the crime rate, we decide to exclude the district from all further analyses. This decision is caused by the fact the preserving Washington DC will prevent us from proper conclusions due to the aforementioned gap between areas.

# population, violent_crime, property_crime
state.df.1 <- ucr %>% filter(jurisdiction!="District of Columbia") %>% 
                    summarise(Population = median(state_population), 
                              `Violent crimes` = median(violent_crime_total), 
                              `Property crimes` = median(property_crime_total)) %>% t()
dc.df.1 <- ucr %>% filter(jurisdiction=="District of Columbia") %>% 
                    summarise(population = median(state_population), 
                              `Violent crimes` = median(violent_crime_total), 
                              `Property crimes` = median(property_crime_total)) %>% t()

# area
state.df.2 <- us_states_info %>% filter(jurisdiction!="District of Columbia") %>% summarise(states = median(area_km2))
dc.df.2 <- us_states_info %>% filter(jurisdiction=="District of Columbia") %>% summarise(dc = area_km2)

#combined
dc.df <- rbind(data.frame(row.names=c("Area"), round(state.df.2,0), dc.df.2),
               data.frame("states"=round(state.df.1,0), "dc"=round(dc.df.1,0)))
dc.df$pct <- percent(dc.df$dc/dc.df$states)
colnames(dc.df) = c("Median values for other states", "Values for District of Columbia", "DC/avg states in pct")
kable_f(dc.df)
Median values for other states Values for District of Columbia DC/avg states in pct
Area 145,349 178 0.12%
Population 4,321,249 599,657 13.88%
Violent crimes 15,899 8,236 51.80%
Property crimes 130,969 30,211 23.07%
library(ggrepel)
plot.data <- left_join(ucr, us_states_info, by="jurisdiction") %>% 
                      group_by(jurisdiction) %>% 
                      summarise(property_crime_total = mean(property_crime_total),
                                violent_crime_total = mean(violent_crime_total),
                                area_km2 = mean(area_km2),
                                state_population = mean(state_population))

plot.data$highlight <- ifelse(plot.data$jurisdiction=="District of Columbia", 1, 0)
plot.data <- plot.data %>% mutate(property_per_area = property_crime_total/area_km2,
                                  property_per_pop = property_crime_total/state_population,
                                  violent_per_area = violent_crime_total/area_km2,
                                  violent_per_pop = violent_crime_total/state_population)

ggp1 <- ggplot() + 
  geom_bar(data=plot.data, aes(x = reorder(jurisdiction, -property_crime_total), 
                               y = property_crime_total, 
                               fill = as.factor(highlight)), 
           stat="identity") + 
  scale_fill_discrete(name="Jurisdiction", 
                      labels = c("other", "District of Columbia")) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "bottom")

ggp2 <- ggplot() + 
  geom_bar(data=plot.data, aes(x = reorder(jurisdiction, -property_per_area), 
                               y = property_per_area, 
                               fill = as.factor(highlight)), 
           stat="identity") + 
  scale_fill_discrete(name="Jurisdiction", 
                      labels = c("other", "District of Columbia")) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none")  

ggp3 <- ggplot() + 
  geom_bar(data=plot.data, aes(x = reorder(jurisdiction, -property_per_pop), 
                               y = property_per_pop, 
                               fill = as.factor(highlight)), 
           stat="identity") + 
  scale_fill_discrete(name="Jurisdiction", 
                      labels = c("other", "District of Columbia")) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none")

ggp4 <- ggplot() +
  geom_bar(data=plot.data, aes(x = reorder(jurisdiction, -violent_crime_total),
                               y = violent_crime_total,
                               fill = as.factor(highlight)),
           stat="identity") +
  scale_fill_discrete(name="Jurisdiction",
                      labels = c("other", "District of Columbia")) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none")  

ggp5 <- ggplot() +
  geom_bar(data=plot.data, aes(x = reorder(jurisdiction, -violent_per_area),
                               y = violent_per_area,
                               fill = as.factor(highlight)),
           stat="identity") +
  scale_fill_discrete(name="Jurisdiction",
                      labels = c("other", "District of Columbia")) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none")  

ggp6 <- ggplot() +
  geom_bar(data=plot.data, aes(x = reorder(jurisdiction, -violent_per_pop),
                               y = violent_per_pop,
                               fill = as.factor(highlight)),
           stat="identity") +
  scale_fill_discrete(name="Jurisdiction",
                      labels = c("other", "District of Columbia")) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none") 

mylegend<-g_legend(ggp1)

grid.arrange(arrangeGrob(ggp1 + theme(legend.position="none"),
                         ggp2 + theme(legend.position="none"),
                         ggp3 + theme(legend.position="none"),
                         ggp4 + theme(legend.position="none"),
                         ggp5 + theme(legend.position="none"),
                         ggp6 + theme(legend.position="none"),
                         nrow=2), mylegend, nrow=2, heights=c(10, 1))

ucr <- ucr %>% filter(jurisdiction!="District of Columbia")
prison_exp <- prison_exp %>% filter(state!="District of Columbia")

Background

The United States has the largest prison population and the highest per capita incarceration rate in the world (it is four times the world average) 7. According to 2018 report of the Bureau of Justice Statistics (BJS), nearly 2.2 million adults were imprisoned in America at the end of 2016 8. That means for every 100,000 people living in the US, about 655 of them were held in prisons and jails. Because of the huge scale of prisoners in the country, also the expenditures on prisons are the highest. According to recent surveys regarding the United States expenditures, spendings on incarceration have increased about three times as fast as spendings on elementary and secondary education during this time period 9.

According to Travis, Western and Redburn 10 from 1973 to 2009, the state and federal prison populations had a stable growth from about 200,000 to 1.5 million. It started declining slightly in the following years. This can be also observed on below plot presenting the trend of the number of prisoners during period 2001-2016. We can see that the number od prisoners grows from 2001 to 2009 and starts to decrease in the next years.

prison_year <- prison %>% group_by(year) %>% summarise(value = sum(prison))

p <- ggplot(data = prison_year, aes(x = year, y = value/1000000, color = year,  
                                          text = paste("Year: ", year,
                                          "<br>Number of prisoners:", comma(round(value), 0)))) +
  geom_point() +
  scale_color_viridis_d() +
  labs(title = "Number of prisoners in state and federal prison in the USA per year", 
       x = "Year", 
       y = "Number of prisoners (in milions)") +
  theme_minimal() +
  theme(legend.position = "none")
  
ggplotly(p, tooltip = "text") %>%  
  layout(yaxis = list(tickformat = "%"))

Average historical violent crimes vs. property crimes per population across states

The proportion of violent and property crimes is similar across regions, where violent crimes consist of 10%-20% of the total number of crimes. When comparing the proportions between regions, South has the highest ratio of violent crimes, while Hawaii and Alaska (remote region) have the lowest ratio. It is important to state that the remote region was created artificially and consist of only two drastically different states. This is shown in further analysis that Hawaii has one of the lowest proportions of violent per property crimes while Alaska has one of the highest.

On the right plot below can be seen the average regional values with the maintained relation between regions. The South region has the highest number of crimes regardless of its significant area.

library(reshape2)
plot.data <- ucr %>% left_join(us_states_info, by="jurisdiction") %>% 
          group_by(region) %>% 
          summarise(
            violent_crime_total = mean(violent_crime_total/area_km2),
            property_crime_total = mean(property_crime_total/area_km2)) %>%  
          melt()

ggp1 <- ggplot(data = plot.data, aes(x=region, y=value, fill=variable)) +
   geom_bar(stat="identity", width=.5, position = "fill") +
   theme_minimal() + 
   theme(legend.position = "bottom", axis.title.x = element_blank()) +
   scale_fill_viridis_d()

ggp2 <- ggplot(data = plot.data, aes(x=region, y=value, fill=variable)) +
   geom_bar(stat="identity", width=.5, position = "dodge") +
   theme_minimal() +
   theme(axis.title.x = element_blank()) +
   scale_fill_viridis_d()

mylegend<-g_legend(ggp1)

grid.arrange(arrangeGrob(ggp1 + theme(legend.position="none"),
                         ggp2 + theme(legend.position="none"),
                         nrow=1), mylegend, nrow=2, heights=c(10, 1))

Below can be seen maps of US states divided by the severity of a crime, that is violent compared to property crimes. In case of property crimes there is a visible division aligned with the historical North and South during the Civil War. In the South there are more property crimes than in Northeast. However when looking at the north and the south in a geographical sense (dividing country horizontally), there is a difference for both types of crimes, once again with higher rate in the south. What is more, when comparing maps we can see a difference in the Northeast region concerning New York, Pennsylvania, New Jersey and Massachusetts. For all of these states, we see that there are more violent than property crimes per population. One more state in which the difference can also be seen is a southern state - Idaho.

#create df with mean values across years per state from ucr
ucr_grouped <- ucr %>% 
                  group_by(jurisdiction) %>% 
                  summarise(violent_crime_total = mean(violent_crime_total),
                            property_crime_total = mean(property_crime_total))
#rename variable for merging
names(ucr_grouped)[names(ucr_grouped) == "jurisdiction"] <- "NAME"
#merge grouped ucr and state spatial data
us_states_ucr <- merge(us_states, ucr_grouped, by = "NAME")

#create values per population
us_states_ucr$violent_crime_per_pop <- us_states_ucr$violent_crime_total/us_states_ucr$total_pop_15
us_states_ucr$property_crime_per_pop <- us_states_ucr$property_crime_total/us_states_ucr$total_pop_15

us_states_midwest <- us_states %>% 
                        filter(REGION=="Midwest") %>% 
                        st_union() %>% 
                        cbind(data.frame(REGION="Midwest")) %>% 
                        st_sf()
us_states_norteast <- us_states %>% 
                        filter(REGION=="Norteast") %>% 
                        st_union() %>% 
                        cbind(data.frame(REGION="Norteast")) %>% 
                        st_sf()
us_states_south <- us_states %>% 
                        filter(REGION=="South") %>% 
                        st_union() %>% 
                        cbind(data.frame(REGION="South")) %>% 
                        st_sf()
us_states_west <- us_states %>% 
                        filter(REGION=="West") %>% 
                        st_union() %>% 
                        cbind(data.frame(REGION="West")) %>% 
                        st_sf()
us_states_regions <- rbind(us_states_midwest, us_states_norteast, us_states_south, us_states_west) %>% st_sf()

# create usa map for both crime types
usa1 <- ggplot() +
          geom_sf(data = us_states_ucr, aes(fill = property_crime_per_pop), lwd = 0) +
          scale_fill_viridis_c(option = "viridis", trans = "sqrt") +
          theme(legend.position = "none") +
          theme_minimal()
usa2 <- ggplot(data = us_states_ucr) +
          geom_sf(data = us_states_ucr, aes(fill = violent_crime_per_pop), lwd = 0) +
          scale_fill_viridis_c(option = "viridis", trans = "sqrt") +
          theme(legend.position = "none") +
          theme_minimal()

# format main map
usa_all1 <- usa1 + 
              ggtitle("Property crimes per population across states")+
              theme(legend.position = "right") +
              geom_sf(data = us_states_regions, aes(color=REGION), alpha=0, size = 0.6) +
              scale_color_manual(values = heat.colors(6)[2:5])

usa_all2 <- usa2 + 
              ggtitle("Violent crimes per population across states")+
                            theme(legend.position = "right") +
              geom_sf(data = us_states_regions, aes(color=REGION), alpha=0, size = 0.6) +
              scale_color_manual(values = heat.colors(6)[2:5])

# zoom and format zoomed map of DC
usa_dc1 <- usa1 + 
            coord_sf(xlim = c(-79, -75), ylim = c(38, 40)) +
            guides(fill=FALSE)+
            theme(axis.title = element_blank(), 
                  axis.text  = element_blank(),
                  axis.ticks = element_blank(),
                  legend.position = "none")
usa_dc2 <- usa2 + 
            coord_sf(xlim = c(-79, -75), ylim = c(38, 40)) +
            guides(fill=FALSE)+
            theme(axis.title = element_blank(), 
                  axis.text  = element_blank(),
                  axis.ticks = element_blank(),
                  legend.position = "none")

# combine both plots and add red rectangle around zoomed area
ggp1 <- usa_all1 + 
          annotation_custom(ggplotGrob(usa_dc1), xmin= -80, ymax= 35)+
          geom_rect(aes(xmin = -79, xmax= -75, ymin=38, ymax = 40), size=0.6, fill=NA, color="black")

ggp2 <- usa_all2 + 
          annotation_custom(ggplotGrob(usa_dc2), xmin= -80, ymax= 35)+
          geom_rect(aes(xmin = -79, xmax= -75, ymin=38, ymax = 40), size=0.6, fill=NA, color="black")

Property crimes per population

Violent crimes per population

Statistical analysis of the dataset

Having introduced general information about incarceration and crimes, we move to a more detailed analysis of our data and examining the relationships between features.

The first dependency we would like to find is the relation between the number of prisoners currently incarcerated and the number of crimes over time. Studies show that the increase of the rate of imprisonment could decrease the number of crimes as a result of isolation of sentenced individuals and their resocialization or demotivating factor for potential criminals. The magnitude of the change cannot be easily predicted, but some studies claim that along with a higher incarceration rate we can observe the diminishing marginal number of crimes 11. Here we will also be comparing the number of prisoners and crimes in different states.

The second dependency we want to reveal is the impact of investment on imprisonment on public safety quantified by the number of different kinds of crimes. We want to see if higher expenditure per prisoner improves public safety by taking into account the number of prisoners per state incarcerated in private prisons.

Furthermorewe will be clustering the states according to crime rates for different types of crimes to check if the results compare geographically.

Prisons and crime

total bubble plot?

ggp1 <- ggplot(left_join(ucr, us_states_info, by="jurisdiction") %>% mutate(year=as.integer(year)),
               aes(x = violent_crime_total, 
                   y = state_population, 
                   colour = as.factor(jurisdiction),
                   size =area_km2,
                   frame=year)) +
  geom_point(show.legend = FALSE, alpha = 0.5) +
  scale_color_viridis_d() +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  labs(x = "Violent crimes total", y = "State population")

ggp2 <- ggplot(left_join(ucr, us_states_info, by="jurisdiction") %>% mutate(year=as.integer(year)),
               aes(x = property_crime_total, 
                   y = state_population, 
                   colour = as.factor(jurisdiction),
                   size =area_km2,
                   frame=year)) +
  geom_point(show.legend = FALSE, alpha = 0.5) +
  scale_color_viridis_d() +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  labs(x = "Property crimes total", y = "State population")

bubbles_p <- ggp1 + transition_time(year) +
  labs(title = "Year: {2000+frame_time}", range=c(2001L,2017L))

bubbles_v <- ggp2 + transition_time(year) +
  labs(title = "Year: {2000+frame_time}", range=c(2001L,2017L))

# anim_save("bubbles_violent.gif", bubbles_v)
# anim_save("bubbles_property.gif", bubbles_p)

This section will have three columns

This section will have two columns

per state map animated

Investments into prisons and the rate of crimes

Number of prisoners and crimes across states

State clustering based on crime rates

data_to_cluster_1 <- ucr %>% dplyr::select(-c(crime_reporting_change, crimes_estimated, violent_crime_total, property_crime_total)) %>% filter(year==2017)

data_to_cluster_1 <- data_to_cluster_1 %>% 
  mutate(murder_manslaughter_per_pop = murder_manslaughter/state_population,
         robbery_per_pop = robbery/state_population,
         agg_assault_per_pop = agg_assault/state_population,
         burglary_per_pop = burglary/state_population, 
         larceny_per_pop = larceny/state_population,
         vehicle_theft_per_pop = vehicle_theft/state_population)

rownames(data_to_cluster_1) <- data_to_cluster_1$jurisdiction

data_to_cluster_1 <- data_to_cluster_1 %>% select(-c(year, jurisdiction, state_population, murder_manslaughter, robbery, agg_assault, burglary, larceny, vehicle_theft))

data_to_cluster_1 <- scale(data_to_cluster_1)
library(heatmaply)
## Warning: package 'heatmaply' was built under R version 3.6.2
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
## 
## ======================
## Welcome to heatmaply version 1.0.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
## 
## Attaching package: 'heatmaply'
## The following object is masked from 'package:formattable':
## 
##     normalize
distance <- get_dist(data_to_cluster_1)
distance <- as.matrix(distance)
heatmaply(distance,  
          cellnote_size = 8, fontsize_row = 6, fontsize_col = 7,
          Rowv = FALSE, Colv = FALSE,
          main = 'Clustering of states based on number of differenct kinds of crimes')
## Warning: 'heatmap' objects don't have these attributes: 'showlegend'
## Valid attributes include:
## 'type', 'visible', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'z', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'transpose', 'xtype', 'ytype', 'zsmooth', 'connectgaps', 'xgap', 'ygap', 'zhoverformat', 'hovertemplate', 'zauto', 'zmin', 'zmax', 'zmid', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'coloraxis', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
res.hc <- eclust(data_to_cluster_1, k=3, "hclust") 
fviz_dend(res.hc, 
          rect = TRUE, 
          k_colors = viridis(5)[2:4],
          rect_border = "grey",
          main = "Clustering of states based on number of differenct kinds of crimes",
          cex = 0.7,
          ggtheme = theme_minimal()) 

clusters <- data.frame(jurisdiction = rownames(data_to_cluster_1),
                       cluster = res.hc$cluster)

rownames(clusters) <- NULL

[^prisoners] Source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=269 [^nyt] Source: https://www.nytimes.com/2006/07/13/us/13deecee.html?_r=1&n=Top%2FReference%2FTimes%20Topics%2FPeople%2FW%2FWilliams%2C%20Anthony%20A.&mtrref=en.wikipedia.org&assetType=REGIWALL

Robocze:

library(reshape2)
plot.data <- ucr %>% left_join(us_states_info, by="jurisdiction") %>% 
          group_by(jurisdiction) %>% 
          summarise(
            violent_crime_total = mean(violent_crime_total/area_km2),
            property_crime_total = mean(property_crime_total/area_km2)) %>%  
          melt()
## Warning: Column `jurisdiction` joining character vector and factor,
## coercing into character vector
## Using jurisdiction as id variables
ggp1 <- ggplot(data = plot.data, aes(x=jurisdiction, y=value, fill=variable)) +
   geom_bar(stat="identity", width=.5, position = "fill") +
   theme_minimal() +
   theme(legend.position = "bottom", 
         axis.text.x = element_text(angle = 90)) +
   scale_fill_viridis_d()
ggp1

 ggp2 <- ggplot(data = plot.data, aes(x=jurisdiction, y=value, fill=variable)) +
    geom_bar(stat="identity", width=.5, position = "dodge") +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.text.x = element_text(angle = 90)) +
    scale_fill_viridis_d()
ggp2

# interpolation <- data %>%
#   group_by(country) %>%
#   mutate(valueIpol = approx(year, women_part, year, 
#                             method = "linear", rule = 1:2, f = 0, ties = mean)$y)
# i=0
# for (i in seq_along(interpolation$valueIpol)) {
#   if (is.na(interpolation$women_part[i]) == FALSE) 
#     i = i+1
#   else if (is.na(interpolation$women_part[i]) == TRUE) 
#     interpolation$women_part[i] <- interpolation$valueIpol[i]
# }

https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/

jaka jest zależność między liczbą więźniów (prison) a wystąpieniami poszczególnych crime na przestrzeni lat (ucr)? czy wzrost uwięzionych zminiejsza odsetek jakiegoś typu przestępstw? czy może jest stały wzrost/spadek przestępstw? (geom line i geom smooth) Does this significant investment into imprisonment improve public safety? wydatki na więzienia a wystąpienia przestępstw - ogółem i w kategoriach, w roku 2016 (najnowsze dane); source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=286 jak wygląda liczba uwięzionych na przestrzeni lat dla poszczególnych stanów? klastrowanie stanów na podstawie liczby różnych typów crime (hierarchiczne lub cos)


  1. Source: https://www.kaggle.com/christophercorrea/prisoners-and-crime-in-united-states#ucr_by_state.csv.

  2. Source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=269.

  3. Source: https://www.ucrdatatool.gov/Search/Crime/State/RunCrimeStatebyState.cfm.

  4. Source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=286.

  5. “For agencies supplying 3 to 11 months of data, the national UCR Program estimates for the missing data by following a standard estimation procedure using the data provided by the agency. If an agency has supplied less than 3 months of data, the FBI computes estimates by using the known crime figures of similar areas within a state and assigning the same proportion of crime volumes to nonreporting agencies.” (cited from https://www.ucrdatatool.gov/faq.cfm).

  6. Sources: https://en.wikipedia.org/wiki/Alaska and https://en.wikipedia.org/wiki/Hawaii.

  7. “US Rates of Incarceration: A Global Perspective”, Christopher Hartney, Research from the National Council on Crime and Delinquency, November 2006, https://www.nccdglobal.org/sites/default/files/publication_pdf/factsheet-us-incarceration.pdf.

  8. “Correctional Populations in the United States”, Danielle Kaeble and Mary Cowhig, Bureau of Justice Statistics, 2016, https://www.bjs.gov/content/pub/pdf/cpus16.pdf.

  9. Source: https://www.ed.gov/news/press-releases/report-increases-spending-corrections-far-outpace-education.

  10. “The Growth of incarceration in the United States. Exploring Causes and Consequences”, Jeremy Travis, Bruce Western and Steve Redburn, Committee on Law and Justice, Washington, DC 2014, https://johnjay.jjay.cuny.edu/nrc/NAS_report_on_incarceration.pdf.

  11. “The Growth of incarceration in the United States. Exploring Causes and Consequences”, Jeremy Travis, Bruce Western and Steve Redburn, Committee on Law and Justice, Washington, DC 2014, https://johnjay.jjay.cuny.edu/nrc/NAS_report_on_incarceration.pdf.